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ABSTRACT 

Future photometric supernova surveys will produce vastly more candidates than can be 
followed up spectroscopically, highlighting the need for effective classification methods 
based on lightcurves alone. Here we introduce boosting and kernel density estimation 
techniques which have minimal astrophysical input and compare their performance on 
20,000 simulated Dark Energy Survey lightcurves. We demonstrate that these methods 
are comparable to the best template fitting methods currently used, and in particular 
do not require the redshift of the host galaxy or candidate. However both methods 
require a training sample that is representative of the full population, so typical spec- 
troscopic supernova subsamples will lead to poor performance. To enable the full 
potential of such blind methods, we recommend that representative training samples 
should be used and so specific attention should be given to their creation in the design 
phase of future photometric surveys. 

Key words: photometric, sn typing, boosting, KDE. 



1 INTRODUCTION 



Supernovae (SNela) provided the first widely 
evidence for cosmic acceleration in the late 

Based 



Type la 
accepted 

1990's ( |Riess et aT||1998| |Perlmutter et al.||1999 1 
on small numbers of predominantly spectroscopically- 
confirmed SNela, those results have been confirmed by in- 



dependent analyses ( 


Eisenstein et al. 


2005 


Percival et al. 


2007 


Mantz et al.|2010 Fu et al. 2008 


Giannantonio et al. 


2008 


Percival et al. 2010| Komatsu et al. 2010 1 and by a 



series of steadily improving SNela surveys. These modern 
SNela surveys have acquired about an order of magnitude 
more SNela than those early offerings, now covering red- 



shifts out toz~ 1.5 l 


Filippenko et al.||2001 


Aldering et al. 


2002 


|Astier et al.|2006| Clocchiatti et al.|2006||Kessler et al. 


2009 


Folatelli et al. 20101. In addition, these surveys now 



have excellent lightcurve coverage with rolling search strate- 
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gies and multi-frequency lightcurve data with significantly 
better control of photometric errors due to the use of a single 
telescope to acquire the data in each major survey. 

The next generation of SNela surveys will be inte- 
grated into major photometric surveys, such as the Dark 



ration 



Energy Survey (DES) (The Dark Energy Survey Collabo- 



2005] ), PanS TARRS ([Kaiser fc Pan-STARRS Team 
2005 b," SkyMapper (Schmidt et al.||2005| ) and LSST ( jTyson" 
2002|. These next generation surveys promise to catalyse 



a new revolution in SNIa research due to the sheer num- 
ber of high-quality SNIa candidates that will be discovered: 
tens of thousands and perhaps millions of good SNIa can- 
didates over the decade 2013-2023. Spectroscopic followup 
will probably be limited to a very narrow subset of these 
candidates and so finding ways to best choose the followup 
subset to utilise the photometric data is a key challenge in 
SN cosmology for the coming decade. 

In this paper we are interested in methods that can 
be used to accurately identify SNela from their lightcurves 
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alone, that is, their variation in brightness in different colour 
bands as a function of time. This is a departure from tradi- 
tional studies of SNela where all SNe used in cosmological 
parameter estimation studies have had their type confirmed 
via one or more spectra. 

There are two ways that one can imagine using photo- 
metric candidates. The first approach is to use all the SNe, 
irrespective of how likely they are to actually be a SNIa. 
This is the approach exemplified by the BEAMS formal- 
ism, which accounts for the contamination from non-la SN 



data using the appropriate Bayesian framework (Kunz et al 



2007). The more conservative approach is to try to classify 
the candidates into la, Ibc or II supernovae, and then only 
use those objects that are believed to be SNela above some 
threshold of confidence. 

The origin of this paper was the Supernova Photo- 



metric Classification Challenge (SNPCC) run by Kessler 
et all |20To|. The SNPCC provided a simulated spectro- 



scopic training data sample of approximately 1000 known 
supernovae. The challenge was then to predict the types 
of approximately 20 000 other objects from their lightcurves 
alone. The challenge is now over, and the results from the dif- 



ferent contributors are summarized in Kessler et al. (2010b I. 

In this paper we present the details of a number of ap- 
proaches to this problem, and their successes and failures. 
In Section [3] we discuss methods we have implemented to 
go from multi-band lightcurves to probabilities while in Sec- 
tion [4] we discuss the performance of the methods in the 
SNPCC. In particular we highlight how a non-representative 
training sample negatively affects the performance of the 
different algorithms. Finally we conclude with recommenda- 
tions for the future. 



36 
8 

-20 
-48 
-76 
47 



-1 
-17 



J 1 








93 -34 25 84 143 




lit 







14 












Id 


7 






L i 








U 












111 










r 


J 


l r 




















-14 
















93 


-34 


25 




84 


143 


32 












\i 


22 














12 

2 




y 










J 


-8 












I 




74 


-32 


10 




52 


94 




27 






11 












-21 






37 







-93 -34 25 84 143 -93 -34 25 84 143 



2 THE LIGHTCURVE DATA 

2.1 The Supernova Challenge Data 

The data used in this paper consists of ~ 20 000 simulated 
SN lightcurves with associated SN types released after the 
SNPCCQ The SNPCC dataware only relevant in our dis- 
cussion of competition scores. Our reason for using the post- 
data is that it has numerous improvements and bug-fixes and 
is a more accurate simulation. The simulation was based on 
a DES-like survey, consisting of 5 SN fields, each of 3 deg 2 , 
such that 10% of the total survey time is allocated to the 
SN survey. The SNPCC dataset consists of a mixture of SN 
types (la, II, lb, Ic), sampled randomly with proportions 
given by their expected rates as a function of redshift. 

Each simulated SN consists of flux measurements in 



the griz filters (Fukugita et al. 1996) and includes infor- 



mation about the sky-noise, point spread function, and at- 
mospheric conditions that are anticipated for the DES site. 
Distances were calculated assuming a standard ACDM cos- 
mology (f_M = 0.3, Qa = 0.7 and w = —1), with anomalous 
scatter around the Hubble diagram drawn from a Gaussian 
distribution with er m = 0.09 and applied coherently to each 



1 These post-SNPCC lightcurves are available 
http://sdssdp62.fnal.gov/sdsssn/SIMGEN_PUBLIC/ 



2 These competition lightcurves 
http:/ /www. hep. anl.gov/SNchallenge/ 



are available from 



Figure 1. (Above) A typical well-sampled SNIa lightcurve, in 
this case at redshift z = 0.694. (Below) The lightcurve of a typical 
well-sampled non-la SN at z = 0.663. Overplotted are the best- 
fitting curves using Eq. [I] 



passband. The SNPCC data includes two selection criteria. 
Each object is required to have at least one observation with 
a signal-to-noise ratio (S/N) above 5 in any filter, and must 
also have at least 5 observations after explosion. A complete 
summary of the SNPCC is given in |Kessler et al] ( |2010|b| ). 



We took part in two of the SNPCC challenges. In the 
first (+H0STZ) challenge, participants were provided with 
photometric host galaxy redshift estimates, based on simu- 
lated galaxies analysed using the methods discussed in |Oy-| 
aizu et al] (|2008[) and asked to return the type of each SN 



candidate. In the second (-H0STZ) challenge, no redshift esti- 
mates for simulated SNe were provided. Both challenges are 
considered in this paper, but with emphasis on the +H0STZ 
challenge. We did not attempt to distinguish between non-la 
sub- types (such as type II and type Ib/c SNe). 

Figure [I] shows the multi-band lightcurve data for a 
randomly selected la and non-la supernova. To these mea- 
surements, a parametric curve has been fitted as discussed 
in Section ROTI 
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2.1.1 Training Samples 

The aim of the SNPCC was for the participants to classify 
each of the simulated SNe into la or non-la (and non-la sub- 
classes if they desired) with the aim of minimising false la 
detections and maximising correct la detections. To aid this, 
a spectroscopic training sample of ~ 1000 SNe with known 
type was provided which is a simulation of expected spec- 
troscopic observations on a 4 meter class telescope with a 
limiting magnitude of r ~ 21.5, and an 8 meter class tele- 
scope with limiting i band magnitude of 23.5. Because spec- 
troscopy is harder than photometry the distribution of SNe 
in this spectroscopic sample is much brighter on average 
than the full photometric sample, and hence is not represen- 
tative of the full sample. This is a crucial point to appreciate 
and as a result in this paper we refer to this sample as the 
non-representative training sample. 

We will often compare with the results from a represen- 
tative sample, generated by spectroscopically following up a 
sample of objects that is representative of the full photomet- 
ric SN population. To produce an unbiased training sample, 
at the conclusion of the SNPCC when the types of each 
SNPCC object were revealed, we randomly selected ~ 1000 
SNe from the entire SNPCC dataset, and considered the ef- 
fect of using this as our training sample. This is referred to 
in the text as the representative training sample. We refer 
to the SNe that require classification as as the unclassified 
set. 



2.2 Post-processed Data 

2.2.1 Fitting a parameterised curve 

In the provided photometric data the number, sampling 
times, frequency and accuracy of the sampled magnitudes 
varies greatly for each supernova, as illustrated in Figure [I] 
In order to standardize the raw data we fit, by weighted least 
squares, a parameterised function to the lightcurves in each 
of the four colour bands. Our parameters are (A,(p,ip,k,a) 
and the flux in each band is taken to be[3 



F(t) = A 



exp 



t-<t> 



k~ k e k + #(t). (1) 



The five parameters to be fit in each band have the following 
interpretations: A + ip is the peak flux, <f> is the starting time 
of the explosion, k determines relative rise and decay times 
and a is a temporal stretch term, r, the time of peak flux, 
is determined by these parameters via r = k ■ a + <j>. The 
function \& is a "tail" function such that F(t) — > ip as t — > oo. 
The exact form (illustrated in Figure [2J of is: 




-0.2 0.0 0.2 0.4 0.6 



1.0 1.2 1.4 



Figure 2. The tail function which is used in fitting Eq. [T] 
Parameters (tp, </>, r) are kept fixed at (0.5, 0, 1) here. 




Figure 3. The effect of varying A on the function F(t) from low 
(dark) to high (light). We keep the parameters (A;, a, <j>, tp) fixed 
at (1, 1,0,0). 




Figure 4. The effect of varying tp on the function F(t) from low 
(dark) to high (light). We keep the parameters (k, a, <fi, A) fixed 
at (1, 1,0, 1). 



files at Cosmology at AIMS (2010 1, each containing 200 ran- 



domly selected and fitted SNe to illustrate the range of fits 
possible. With five free parameters, A, tp, (p, k and a in each 
colour band and a host redshift (in +H0STZ challenge), we 
have 21 parameters specifying each SN. We do not require 
that there be any correlation between the derived parame- 
ters in any band, e.g. between explosion time, time at peak 
or stretch. This is a natural extension to study in future 
work. 



*(«) = 





cubic spline 



-oo <t<<j> 

<P <t <T 
T < t < OO 



where the cubic spline is uniquely determined to have 
zero derivative at t = <p and t = r. The effect of each param- 
eter is illustrated in Figures [3] to [5] We have also posted two 



3 This function has a single maximum and therefore cannot fit 
examples which have a double peak. However, for the data we use 
in this paper this turns out not to be an important limitation. 



2.2.2 Sparse datasets 

About 5% of all the SNe had fewer than 8 observations in 
one or more of the four bands. To avoid overfitting, we did 
not fit these SNe with Eq. [I] Instead, these sparsely sampled 
SNe were each fit to a 5 dimensional point - the maximum 
flux in each of the four colour bands plus the host redshift. 
The KDE and boosting methods (Section^ were applied to 
these SNe in the same way as was done in the 21 dimensional 
case (Sections |4.1.1| |4,2.1[ ). Unless otherwise stated, discus- 
sions and illustrations will all reference the 95% of SNe which 
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Figure 5. The effect of varying a on the function F(t) from 
0.1 (dark) to 1.0 (light). Increasing a linearly stretches the curve 
away from the t = (f>. We keep the parameters (A, <j>, k, r) fixed at 
(1,0,1,3). 




Figure 6. The effect of varying k on the function F(t) from 0.2 
(dark) to 1.8 (light). Increasing k decreases the ratio of rise to 
decay time (Rapid rise relative to decay means low k). We keep 
the parameters (A, er, <fi, t) fixed at (1, 1.5, 0, 3). 



had 8 or more observations in all bands and hence were fit 
with 21 parameters. 



2.2.3 SALT fits 



In Section |4.4[ we consider classification methods that re- 
quire information on the distances to SNe to constrain their 
type. Distance moduli for all SNPCC SNe were derived us- 



ing the publicly available lightcurve litter S ALT2 ( Guy et al. 
[20r7f| ). Fits were carried out using the g, r and i passbands 
(i.e. z colour band data was not included). All available SNe 
were considered, which is significantly more liberal than the 
usual data-quality cuts applied during past SN cosmology 
analyses (Ke ssler et al. 12009] ). In this way, we maximized the 
number of SNe available for this work. We applied SALT2 
to 1256 SNe available in the non-representative training 
sample. Immediately, we found that 165 SNe failed to pass 
through SALT2 with the reported error of the lightcurve 
either having too low a S/N or missing g-band data. We 
did not investigate these errors further and simply exclude 
these SNe. Furthermore, when the S/N is low, SALT2 fits 
some SNe but returns a default upper limit magnitude of 99 
and is unable to produce meaningful parameters from the 
lightcurve fit. This affected 62 SNe in the training sample, 
which were also removed from the sample. For the 1029 SNe 
that were successfully fitted, SALT2 returned a best fit value 
for four parameters M, Xq,Xi and c for each event (which 
relate the peak magnitude and stretch/colour corrections to 
the lightcurve). The best-fit la model lightcurve was also re- 
turned in the observer frame, which we used to calculate the 
X 2 value for each SN in each passband (g,r,i) which are used 
in Section|4~4"]to classify SNe. Distance moduli are calculated 
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Figure 7. Hubble diagrams for the 2 training samples considered 
in this paper. SNela are shown as red triangles, while non-la SNe 
are plotted as blue squares. Also shown is the best-fit cosmology 
to each SNIa sample. (Above) The representative training sample, 
with Q m = 0.23. (Below) The non-representative training sample, 
as provided for the SNPCC, with U m = 0.3. 



with 



fi — (tub — M) + aii — /3c 



(2) 



where we used values of a = 0.1, /3 = 2.77 and M = 30.1 
to calculate the distance moduli, as discussed in |LampeitT| 
et al. (20101. These values are consistent with those found 



in other analyses and were not expected to significantly af- 
fect our results. Figure [7] shows the Hubble diagrams for 
the two training samples considered in this analysis. Also 
shown is the best-fit cosmology to each la dataset assum- 
ing a flat ACDM model. The non-representative training 
sample has a best-fit value of fl m = 0.30 compared to a 
value of Q m = 0.23 for the representative training sam- 
ple. In the non-representative training sample, non-la SNe 
are predominately found at lower redshifts than the rep- 
resentative training sample due to the effective magnitude 
cuts coming from the spectroscopic requirement of the non- 
representative sample. 



3 NEW CLASSIFICATION METHODS 

We now describe in very general terms the classification al- 
gorithms we have used to facilitate application to other areas 
of cosmology and astrophysics. In order to classify a given 
object Y as either la or non-la, one would like the posterior 
probabilities P (Y = Ia|x) and P(Y = non-Ia|:r) = 1-P(Y = 
Ia|a;). Here x are the parameters or features that character- 
ize the supernova. Knowing these posterior probabilities is 
equivalent to knowing the odds: 



odds(x) = 



P(Y = I&\x) 
P(y = non-Ia|x) ' 

Now one classifies Y as a la for example if odds(x) > 1, 
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Figure 8. Schematic figure illustrating the idea of a KDE in one 
dimension. The training data points are shown as dark points 
with arrows. The Gaussian kernels are shown together with the 
sum of the kernels. Note that the KDE is not normalised in this 
figure and is thus close to what we actually use in this paper. 



i.e if P(y = Ia|a;) > 0.5. The two methods we discuss in this 
Section approximate the odds in different ways: 

1) Kernel Density Estimation estimates P(a;|y = fa) 
and P(sjF = non-fa), the density of the features in classes 
fa and non-fa respectively, and then uses Bayes formula to 
give 

P(x\Y = fa) • P(V = fa) 



odds(x) 



P(x| Y = non-fa) ■ P(Y = non-fa) ' 



2) Boosting directly estimates odds(x) through regres- 
sion methods, as a sum of small trees built by a type of 
functional gradient descent. 

These methods are discussed in detail below. 



3.1 Kernel Density Estimation (KDE) 

Kernel Density Estimation (KDE) is a non-parametric 
method for estimating the probability density function (pdf ) 
of a sequence of observables. Within this paper, the proba- 
bility densities of the post-processed data described in Sec- 
tions |2~2~T1 - l2~2~3l are used for classification. Pdfs are use- 
ful as we may base a classification rule upon the relative 
probabilities that a candidate supernova is either type fa 
or not type la. Such a classification rule will require both 
the fa and the non-la probability densities for the observed 
SN data. KDE enables us to derive these pdfs in a fairly 
model-independent manner, as we now discuss. 

Suppose we have a set of d observables and that we 
would like to estimate the value of the pdf at a point x in 
this d-dimensional space. Given a training set with n ob- 
servations, i.e. n points Xi in this d-dimensional space, the 
Kernel Density Estimate (KDE) is given by 



fh(x) 



1 " f 

»=i 



x — Xi 
h 



(3) 



where fh (x) is the KDE, Xi is the i-th training observation, 
Ki is the kernel function for the i-th training observation 
and h is the global kernel bandwidth, h is a tuning param- 
eter: the kernels become more "peaked" about the training 



CD' 




Figure 9. A realisation of fifty points from an unusal distribution. 
Around each observed point a kernel is constructed. The axes of 
each kernel are the eigenvalues of the point's (2 X 2) matrix 
(Eq.[4]|. Each £j is the covariance matrix of the nearest I points 
multiplied by the global bandwidth, h. Here h = 0.6 and i = 10. 



observations as h becomes smaller. The optimal bandwidth 
may be obtained by cross-validation (see Appendix [A|. The 
choice of kernel is arbitrary, except that any proposed kernel 
should satisfy the following two conditions: 

• J K(x)dx = f 

• K(-x) = K{x) 

The first condition ensures that the KDE integrates to unity 
and that all observations carry equal weight, whilst the sec- 
ond condition ensures that the KDE is unbiased and is cen- 
tered about one of the n d-dimensional training data points. 
The basic idea of the KDE method is illustrated in Figure [8] 
in a simple f D example. A commonly used kernel (and the 
kernel that we will use throughout this paper) is a multi- 
variate Gaussian, normalised to unit volume: 



A | ^ T ^,E l 



1 I 1 x-Xi 
— - exp < 



(4) 



x-Xi 



Here x and Xi are d dimensional vectors and Hi is a d x d 
covariance matrix that changes the orientation and shape of 
the kernel around each training observation i; for example 
the covariance matrix Ej can be estimated from the nearest 
t neighbours of a training data point, which is what we do, 
as described in Section |4.f | and as illustrated in Figure [9] 
This provides the possibility of adapting the kernel to local 
variation, fn contrast the bandwidth parameter h affects the 
global behaviour of the kernels. While it is more common to 
choose the covariances to be equal, for the SNPCC and the 
current application this would have been a bad choice (as 



described in Section 4.1 1 
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3.1.1 Integration over data errors 



In order to classify a supernova with lightcurve measure- 
ments x, we must evaluate the KDE at x. However in our 
case we are not sure where x lies in parameter space as the 
lightcurve measurements have errors and are not perfectly 
sampled. 

Using a Gaussian kernel, we write the KDE as 



h* K { h 



(5) 



For simplicity we suppress vector notation but all quantities 
(other than h) are d dimensional vectors or matrices, and 
the index i runs over the points in the training set. 

Now assume that the location of a point in the d dimen- 
sional space is not known exactly and is instead given by a 
Gaussian pdf. We take the mean to be x and the covariance 
matrix to be Y. The KDE value is then given by integrating 
the KDE over the unknown pdf of the point being classified: 



dzK (z - x, Y) f(z) = 



(6) 



-T-f 

h d J 



dzK (z~x,Y)K 



h 



Si 



We notice that this reverts back to the original value if K 
is a delta-function located at x. Further, the function be- 
ing integrated is a product of two Gaussians, which is itself 
another Gaussian. The KDE value then simplifies to 

-Xi 



n ^ h d 



■ t T,i + h- 2d Y 



(7) 



i.e. the KDE kernels simply have an increased variance, given 
by the sum of their covariance matrix and the covariance 
matrix of the point being evaluated, scaled by h~ 2d . The 
importance of including this increased variance for uncer- 
tain observations should not be ignored, especially when the 
variances of the points being classified are large (as is the 
case in this paper). Correctly implementing Eq. Q can sig- 
nificantly improve classification performance. In Section |4.1| 
we compare analyses on the SN data including and ignoring 
the covariance Y. 



3.2 Boosting 



Boosting ( jFreund fc"S chapirc ( 1995|) is a learning algorithm 
for classification. Until recently the most popular boost- 



ing algorithm was AdaBoost (Freund & Schapire (19971). 
AdaBoost works by combining weak-classifiers into a com- 
mittee, whose combined decision is significantly better than 
that of individual weak-classifiers. The precise workings be- 
hind AdaBoost's success remained hazy until it was shown 



( Friedman et al. ( 2000 1) that boosting produces the powerful 
committee by sequentially adding together weak-classifiers 
calculated by steepest descent. The further ideas of slow 



learning (Friedman 20011 and bagging ( |Friedman| (2002)) 
were later introduced into boosting, culminating eventually 
in the Gradient Boosting Machine (GBM) algorithm. The 
algorithm, implemented as a package in the s tatisti cal pro- 
gramming language iQ is described in Section 



3.2.3 



A brief 



4 R and its associated packages can be downloaded from 
http:/ /www.r-project.org. 
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Figure 10. (Above) A tree of depth 2 for classifying an object 
into one of 2 2 regions. (Below) The tree domain containing 2 2 
distinct regions as defined by the tree. 



discussion of trees and loss functions is presented in Sections 
|3 . 2 . 1 1 and |3.2.2| in preparation for the presentation of the 
GBM algorithm. 



3.2.1 Tree functions 

The most widely used weak-classifiers (a.k.a. basis func- 
tions) in boosting are trees. Trees are discontinuous func- 
tions which take discrete values in different regions of a do- 
main. That is to say, a tree T has the form: 



T(x) = 



ZK 



if x G Ri 



if x £ Rr, 



where the K distinct regions R\ • • ■ Rk together par- 
tition a?-space. The region boundaries can be described 
through the branchings of a tree, as illustrated in Figure [l0"| 
For boosting, it is common to only use trees of a very simple 
form, that is only trees with branchings of the form x^ < v, 
where x^ is one of the dimensions of £-space and v is a real 
number. In the case of the SNPCC, x are the parameters 
fitted to the lightcurves in Section [2.2. 1| 



3.2.2 Loss function for classification 

Suppose we have observed n training points, each consist- 
ing of data and type: (Xi,Ti), where the data Xi is a d- 
dimensional vector, and the type n is ±1, corresponding 
to the two classes. Suppose that we are required to find a 
function F : R d — > R which minimises the following loss 
function: 



L(F) = J2 lo § i 1 + ex P [-2^)7-*] ) . (8) 
The specific form chosen for the loss function [8] can be 
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explained by consid ering its partial de rivatives with respect 



dL 



2n 



to F(Xi). Doing so (Hastie et al 



the form of F which minimises 



2009 1, it can be shown that 



8b is given by: 



nxi) = 



# observations: X%,t = 1 
# observations: Xi,r = — 1 



(9) 



This is an approximation to half the log odds (the log 
of the odds): 



log odds = log 



P(r, = l\x = X, 



•l\x = Xi 



(10) 



This is the key result: a function which minimises the loss 
function |8| is a good approximation to half the log odds. 
A good approximation to the log odds is exactly what is 
needed for classification problems. The boosting algorithm 
aims to approximately minimise this loss function and in so 
doing arrive at an approximation of the log odds which can 
then be used for classification. 

If you have observations at every possible data point, 
you can directly approximate the log odds through Jjj). In 
reality, you will not have observations at all possible data 
points, and so cannot do this. This trivially corresponds to 
not having observed all possible lightcurves, and so needing 
to make inferences from simililar lightcurves. Boosting does 
this inference through constrained minimisation of the loss 
function, as described in the following section. 



dF„ 



(A) 



1 + exp 



[2F(^<)ts] 



2) Fit by least squares T m , the new tree: Z{ ~ T m (Xi). 
(where T m has regions R m ,i ■ ■ ■ R m ^ D fitted to minimise 
the ingroup variance: see Appendix [c] for details.) 

3) Choose constants y m ,i'"7m,2 D f° r Rm,i • ■ ■ R m ,2 D ■ 
(chosen to minimise L {Fm-i + T m )) 

4) F m 4— F m -i + T m 
•Finally, F <- F M - 

F is our final approximation to half the log odds, and it can 
now be used to classify with a simple rule of the form: 

IF F(xi) > v =>• n = 1; ELSE n = -1, 

where the optimal v depends on the Figure of Merit. 

Notice that the variable Zi, updated in step 1, is positive 
if Tj = 1 and negative if n = — 1. For this reason, when T m 
is fit to the Zi's at step 2, observations of the same type are 
more likely to fall into the same region of T m . Moreover, 
observations with large Zi's carry more weight while fitting 
T m , and hence are even more likely to be placed with objects 
of the same type. This acts to place special attention on 
unusual objects, or objects whose type is not clear. 

While values are fitted for each tree region in step 2 (as 
described in Appendix |C|, these values will not necessar- 
ily result in a reduced L(F m -i + T m ). Hence at step 3 of 
the algorithm, 7 m ,fc values are explicitly chosen to minimise 
L (Fm-i + T m ). In effect, only the tree shape is taken from 
step 2. 



3.2.3 The Gradient Boosting Machine 

The Gradient Boosting Machine ( Friedman|2001 1 works by 
sequentially adding new trees to a function F, each addition 
reducing L(F) (JSJ) and so improving the approximation of 
F to half the log odds. 

The trees, which have depth D, are appended to F at 
each of the M iterations of the GBM algorithm. Choosing 
larger M and D values results in a final L(F) nearer to the 
global minimum value Q. However, our end objective is 
not to reach the global minimum but to construct a good 
approximation to the log odds, and trees of lower depth are 
generally better suited to this end. 

Algorithm 1 (below) outlines an implementation of the 
GBM. A few subtleties have been omitted from it here, and 
we refer you to Appendix [E] for a fuller description. We rec- 
ommend watching our demonstrative animation of the algo- 
rithm while reading Algorithm 1. The animation can be be 



found at Cosmology at AIMS (20101 



Algorithm 1 - Gradient Boosting Machine 



where r = — \ n . 
n 



■Input: Xi,Ti for observations i = 1 to n. 

•Imtahse: ro(x) <s— - log 

2 1 — r 

•Initalise: Zi <s— for observations i = 1 to n. The ZiS will 
measure how much of a "misfit" each observation is. 

•Choose tree depth D and number of trees M. 

■for m = 1 to M: 

1) for i = 1 to n, update zc 



4 RESULTS 

The entries in the SNPCC were evaluated using the Figure 
of Merit (FoM): 

f(Nf a ,N* m _ Ia ) = efficiency x pseudo-purity 



Ni 

iy la 



Nt, 



V V Ja / \ JV /a + J ^non-la 



where Nf a is the number of correctly classified SNela, 
is the number of non-la SNe classified as SNela, 



and Nj° T is the total number of SNela. Had the coefficient 
of N^ on _ Ia in the denominator of the pseudo-purity term 
been 1 and not 3 the term would have been true purity, i.e. 
the proportion of SNela in the final la-classified group. How 
relevant this FoM is to cosmology is not absolutely clear, but 
it is a robust measure of how well a classification algorithm 
penalises both missed detections and false discoveries. For 



applications such as BEAMS (Kunz et al. 2007 1 a FoM which 



takes type probabilities as inputs would be more useful. 

In this section we discuss the implementation and per- 
formance of each of our methods. Unless stated otherwise, 
the scores given in this section refer to the SNPCC, while 
all figures are using the post-SNPCC data described in Sec- 
tion |2.1| Of particular interest to us is the comparison of 
results obtained when the training is done with representa- 
tive and non-representative samples. We also briefly men- 
tion applications that these methods have previously found 
in cosmology and related fields. 
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Figure 11. Ia (red crosses) and non-la (blue circles) in the non- 
representative training sample. The KDE values at calculated us- 
ing tenfold cross-validation. 



4.1 21D KDE 

4-1.1 Application 

Kernel Density Estimators have been used before in astron- 
omy for estimating the probability density function from 
a discrete or noisy data set (Fadda et al 



1998 Bissantz 



et al. 2007 Ascasibar 2010 1, identifying groups (Balogh et al 



20041 and clusters (Valtchanov et al. 2004) in galaxy sur- 



veys, and determining the timings of millisecond pulsars 
( Carstairs et al.|1991 1 and gamma-ray bursts (de Jager et al. 



1986), to name a few examples. 

In Section 12.2.11 we described how we fit the SN 
lightcurves in each of the 4 bands using the parameterised 
function |l]), resulting in twenty lightcurve parameters. With 
the addition of host redshift in the case of the +H0STZ chal- 
lenge, each SN is described by a 21 dimensional (21D) point. 
We use KDE to approximate the 2 ID Ia and non-la proba- 
bility density functions (pdfs) based on the training data. 

We allowed the 21D training points to have different 
covariance matrices, as described in Section |3.1| As previ- 
ously mentioned a single global covariance is most common 
for KDE, but in cases where a pdf has large regions of high 
and low probability, this can be problematic. In low proba- 
bility regions the kernel density will be too "spikey" while 
in high probability regions it will be too smooth. To under- 
stand this, consider what would happen if, in Figure [9] the 
ellipses were constrained to all be of the same size. Cho- 
sen too small and the low probability region would have 
"bumps", too large and the high probability region would 
lose features. The 21D points for the SNPCC are not uni- 
formly distributed, as illustrated by the cumulative plots of 
Appendix |Fj and so are susceptible to this problem. Using 
cross-validation we chose t = 10 and h = 0.6 (using the 



notation from Section 3.1 1 



Having constructed two KDEs from a training sample, 
each unclassified SN may be classified as follows: 

(i) Fit Eq. [I] to each of the four lightcurves thus obtaining a 
21D point for the candidate. 

(ii) Evaluate the Ia and non-la kernel probabilities derived 
from the training sample at the 21D point, and then evaluate 
the odds. 



In cases where one or both of the KDEs are a poor rep- 
resentation of the underlying pdf, it may be preferable to 
modify step (iii). For example if one of the KDEs is partic- 
ularly inaccurate, one may prefer to classify by using only 
the other KDE. For the SNPCC leaving step (iii) unchanged 
was advisable, as can be deduced from Figure [TT] The lines 
in Figure [TTJ are lines of constant odds. If KDEs are accu- 
rate approximations to pdfs, a line of constant odds is op- 
timal for discriminating between la's (below the line) and 
non-la's (above the line), irrespective of the FoM used. Fur- 
thermore, if the KDEs are accurate approximations to pdfs, 
there should be an equal number of la's and non-la's on the 
line odds = 1 and 1000 times more la's as non-la's on the 
line odds = 1000. This is roughly observed in Figure [TT| and 
so we can proceed to choose the odds line which maximizes 
the SNPCC FoM. 

For the entry in the SNPCC, we failed to include the 
parameter covariance matrices when calculating KDE val- 
ues (in effect, we set Y to be a matrix of zeros in Eq. [7|. 
Our final score suffered as a result - the benefit of correctly 
implementing the calculation |7| is illustrated in Figure [13} 
where we see from both the histograms and the cumulative 
plots an increased separation between la's and non-la's when 
Eq. [7]) is correctly implemented. We find a 15% increase in 
score when correctly implemented on the post-SNPCC data. 
The KDE method still obtained the second and third highest 
scores in the -H0STZ and +H0STZ competitions respectively, 
with scores of 0.37 and 0.39. Of interest is that the 20D KDE 
(-H0STZ) is almost as good at classifying as the 21D KDE 
(+H0STZ). The winning competition scores (Kessler et al. 
2010b| were 0.51 (-H0STZ) and 0.53 (+H0STZ). 



4-1.2 Non-representative vs representative 

As with all of our methods, we constructed classifiers using 
both the non-representative sample provided and a repre- 
sentative sample of equal size, as described in Section [2. 1.1 1 
In each case, the remaining unclassified SNe were used as a 
test of the performance of the classifier. 

Figure [12] carries useful information about the perfor- 
mance of the non-representatively trained KDEs and repre- 
sentatively trained KDEs. For example, the efficiency of clas- 
sifying la's with a log odds threshold of 2 is simply the cumu- 
lative value of the unclassified la's (solid red) at log odds — 2. 
For both representatively and non-representatively trained 
KDEs this is about 0.75, meaning that about 75% of SNela 
are correctly classified when a threshold of log odds = 2 is 
used. 

To obtain high purity, the log odds threshold must be 
chosen such that the non-la cumulative frequency is low 
compared to the Ia cumulative frequency. To obtain high 
efficiency, the log odds threshold must be chosen such that 
the Ia cumulative frequency is high. Putting these together, 
to obtain both high purity and high efficiency, a log odds 
threshold must be found at which the non-la cumulative 
frequency is low and the Ia cumulative frequency is high. 

The dashed lines in Figure [12] are the cumulatives of 
the training data using tenfold cross-validation. In the case 
of representative training, we see that these are accurate 
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Figure 12. The cumulative frequency of log odds for non-la 
(blue) and la (red) SNe, for the training (dashed) and test (solid) 
samples. The training log odds were calculated using tenfold cross- 
validation. (Above) Using non-representative training and (Be- 
low) using representative training. 
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Figure 13. (Above) Histograms and (Below) cumulative plots of 
the 21D (representatively constructed) KDE log odds. (Left) The 
parameter covariance matrix is not included in KDE evaluation 
as proposed in Section |3.1.1| (Right) The parameter covariance 
matrix is included in KDE evaluation. 



predictors of the true cumulatives. But in the case of non- 
representative training, the non-la cumulatives of training 
and unclassified SNe are vastly different. If in the case of 
non-representative training one assumed that the training 
sample were in fact representative, one would predict a non- 
la misclassification rate of under of 10% using a log odds 
cutoff of 1. In reality it is 30%. Such dangerous predictions 
are impossible to make if a representative sample is used in 
KDE construction, as illustrated by the hugging of the solid 
lines to the dotted lines. 



0.6 

0.5 

2i 0.4 
o 

£0.3 
0.2 
0.1 
0.0 



• rep. (1:3) 

• non-rep. 
► rep. (1:1) 

X +HOSTZ 



10- 



10 J 



10 4 



Training sample size 



Figure 14. (Small black circles) The score obtained by boosting 
when trained with random repreresentative samples of varying 
size (100 to 6000 SNe). (Large red circle) Training on the given 
non-representative sample. (Blue cross) The score obtained in 
the +H0STZ competition. (Green triangle) The performance when 
trained with a "random" sample with non-random la: non-la ratio 
of 1:1 as opposed to true ratio Ia:non-Ia ~ 1:3. 



4.2 Boosting 

4-2.1 Application 

Boosting has been used in particle physics, for example by 



the MiniBooNE neutrino oscillation experiment (Roe et al 



2005) and is implemented in the photometric redshift pack- 
age AborZ ( jGerdes et aT1|2010| ). In the SNPCC we applied 
boosting to the twenty fitted lightcurve parameters for the 
-H0STZ competition, and the twenty-one parameters for the 
+H0STZ competition. Using tenfold cross-validation we chose 



to use 4000 trees to maximize the FoM (111. We chose the 



learning rate to be 0.05 and the bagging fraction to be 0.5 
(these parameters are described in Appendix [eJ. 

During the training phase of the SNPCC we expected, 
based on the idea that the training sample was represen- 
tative, that boosting would significantly outperform the 
21D KDE. In reality boosting performed more poorly than 
the 21D KDE, obtaining scor es of 0.20 (-HDSTZ) and 0.25 
(+H0STZ) (Kcss ler et al.|2010b I strongly suggesting that the 
21D KDE method is more robust to biases in the training 
set than boosting. 

In the case of the post-SNPCC data, the score ob- 
tained with non-representative training is even lower (0.15) 
(+H0STZ) due to bugs in the original SNPCC data such as too 
dim non-las which made classification easier, as described in 



Kessler et al. (2010b I. As a result comparison of scores in 



this paper with those in the competition cannot be made 
directly. 



4-2.2 Non-representative vs representative 

Our failure to correctly predict our score in the SNPCC 
was a result of the biases in the training sample. Boosting 
appears to be even more sensitive to training sample bias 
than the 21D KDE method. This is illustrated by the large 
deviation in Figure [15] of the unclassified non-la curve from 
the training non-la curve with non-representative training. 
While boosting is more sensitive to bias in the training 
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sample than the 21D KDE, it is a superior classifier when 
a representative training sample is used. This is illustrated 
in Figure [15] by the large vertical separation between non- 
la and la cumulative curves when a representative sample 
is used. The vertical separation between the la and non-la 
curves is larger in the case of the boosting than the 2 ID 
KDE, resulting in a lower contamination rate and higher 
efficiency when boosting is used. 

We see from Figure[l4]that training with 1000 represen- 
tative SNe results in a score 3 times greater than training 
with 1000 non-representative SNe. We also see from Fig- 
ure[T4]that training with a non-representative sample of size 
1000 can be matched by training with only 50 representative 
SNe. The score obtained when 500 representative la and 500 
representative non-la SNe are used for training, as opposed 
to the truly representative case where the Ia:non-Ia training 
ratio is 1:3, is only slightly higher; the advantage of extra 
la's at the cost of non-la's is marginal. 

We did not include the parameter covariance matrices in 
any way in boosting. It is not clear how this inclusion would 
best be done, but the noticeable improvement to the 2 ID 
KDE score when the covariance is included suggests that 
it is worthwhile considering this question for future imple- 
mentations. Two possibilities are a) 'supersampling' - con- 
verting each training point into 100 training points drawn 
from a distribution with covariance given by the parameter 
covariance matrix, and b) including the covariance matrix 
determinant as a 22nd boosting parameter. 

We find that with boosting if a non- representative train- 
ing sample is used the cumulative frequency lines of the un- 
classified SNe do not follow those of the training sample. 
On the other hand if a representative sample is used, ten- 
fold cross-validation provides accurate predictions for the 
unclassified SNe boosting values, as illustrated by the close 
hugging of training and unclassified cumulative lines in Fig- 
ure [m 

We see that boosting the 21D lightcurve parameters 
with a representative sample results in a robust photomet- 
ric classifier. To illustrate this point we have created on on- 
line archive of 200 randomly selected unclassified SNe, and 
labeled them according to boosting's output |Cosmology at| 
|AIMS|p010l >. In some cases it is difficult to identify obvious 
la or non-la features, yet the algorithm classifies correctly. 
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Figure 15. Boosting values obtained using (Above) the non- 
representative training sample and (Below) the representative 
training sample. The boosting values are approximations to 
~ log odds. 
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Figure 16. Cumulative plot of redshift, non-representative train- 
ing (dashed) vs unclassified (solid) and Ia (red, thick) vs non-la 
(blue, thin). 



4.3 Parameter importance 

One advantage of the boosting algorithm is its ability to 
quantify the importance of parameters in classification (see 
Appendix [P] for details). In this section we look at these 
quantities in an effort to discover which fitted parameters 
are most useful for classification. We also ask which are the 
parameters that distinguish the non-representative training 
sample from the representative training sample, i.e. what 
makes the non-representative la's and non-la's a biased sam- 
ple. We answer this question by performing boosting on a 
sample of representative and non-representative la's, as if 
the SNPCC had been a competition to determine if a SN 
attains a spectrum or not. 

Figure [T7] illustrates which parameters are most useful 
in distinguishing Ia from non-la in the representative train- 
ing sample. One interesting feature illustrated in Figure [17] 
is that every parameter appears to carry information. 



The third most important parameter (after redshift and 
and A in z-band) is the parameter k in the i-band. To in- 
terpret this piece of information, we first see in Figure |F3| 
that non-la SNe have on average lower k values than la's. 
From this we then infer from Figure|6]that la's have a higher 
rise-time to decay-time ratio than non-la SNe. 

The equivalent figure for the non-representative train- 



ing (Figure Gl in Appendix [GJ) paints a similar picture with 
one noticeable difference: The information for distinguishing 
between Ia and non-la SNe in the non-representative train- 
ing sample is carried almost exclusively in the r-band. 

We now turn to the comparison of representative and 
non-representative SNe. Figure [18] suggests that the most 
biased parameter in the non-representative training sample 
is redshift. This is not surprising given that we know that the 
non-representative SNela are at lower redshift than the true 
Ia population (Figure [l6| . Indeed, we see from Figure [T6| 
70% of Ia SNe in the non-representative training set are 
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Figure 17. The importance of each of the 21 parameters in clas- 
sifying SNe as la (or not) using boosting on the representative 
training sample. 
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Figure 18. The importance of parameters in distinguishing rep- 
resentative from non-represesntative SNela using boosting. 



at a redshift of less than 0.6, while only 20% of las in the 
unclassified set are within this redshift. 



In the case of non-la's SNe (Figure G2 in Appendix |G 



boosting allocates the majority of the bias in the non- 
representative sample to the A's. This is also unsurprising 
given that we are more likely to obtain a spectrum from 
bright objects than dim objects. It is not clear to us why 
boosting designates non-la bias to the A's and la bias to 
redshift. 



Figure 19. 3D contours of the difference between the la and non- 
la Hubble diagram KDEs as a function of redshift and distance 
modulus (fi) together with the actual non-representative training 
data used to produce the KDEs. The data used to construct the 
KDEs are also shown: la data as red circles and non-la data as 
blue crosses. There is a clear offset in the two KDEs reflecting 
the fact that in this training data the non-la's are fainter, hence 
predominantly at lower redshift and with a much larger scatter 
than the la's. 



4.4 Hubble KDE 

4-4-1 Applications 

An alternative method for using the idea of KDEs is to use 
the SALT2 light curve fitter (with a = 0.1 and f3 = 2.77 as 
in 



Hicken et al. ( 2009 1) to estimate distance moduli, in and 



errors <r; for all the objects in both the training and test 
data, assuming that all the data are SNela. We can then 
construct two 2D KDEs for the training data: one consist- 
ing of all the known SNela and one from all the non-la data. 
Each kernel is normalised to have a total volume of unity 
and we use a slight modification of the standard KDE for- 
malism because we do not normalise the KDE. Instead the 
heights of the summed KDEs are proportional to the num- 
ber of SNela and non-la respectively. In this way we include 
prior information related to the supernova rates. A redshift 
range where there are many more SNIa than non-la will 
automatically tend to lead to a larger la KDE as a result. 
Of course, this does increase sensitivity of the method to 
biases/no n-representativity in the the training sample rates. 

The 2D Gaussian kernel chosen for the Hubble KDE 
algorithm had a fixed bandwidth (standard deviation) in 
the redshift direction of 0.05 (chosen simply to avoid be- 
ing too peaked but small enough that the distance modulus 
does not change significantly across it) while the bandwidth 
in the \x direction was determined by the error <r;, on the 
distance modulus coming from SALT2, and also includes a 
0.12 mag intrinsic dispersion error as usual. This means that 
points with large errors contribute very broad, low ampli- 
tude humps to the final KDE, while points with small er- 
rors are much more peaked, reflecting our confidence in that 
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point. For illustrative purposes we plot the difference KDE/ a 
-KDE non _ia of the two KDEs in Figure [l9| Positive values 
correspond to places where the la KDE dominates, negative 
values to where the non-la KDE dominates. In addition we 
plot the training data used to construct the KDEs. 

Classification using these KDEs is then simple. For any 
candidate object, we run it through SALT2 to give an esti- 
mated jj, and cr. We can only use this approach on the data 
with a redshift estimate, z, unlike the 21D KDE and boost- 
ing algorithms which do not require a redshift. We then sim- 
ply find the values of the two KDEs at that (fi, z) to yield 
probabilities of the object being a la or non-la. As in the 
other KDE method, one should fold in the error a on the 
candidates which, assuming Gaussianity, is simple, as de- 
scribed in Section |3.1.1| The result of this analysis is that 
each candidate has a pair of probabilities: (Pi a , Pnon-ia) 
that can be used to classify the candidate. 



4-4-2 Non-representative training sample 

We applied this methodology to the whole sample of un- 
known SNe supplied. In total, we started with 17065 SNe 
and lost 4578 SNe as junk because of SALT2 failures previ- 
ously mentioned (of which 2619 were complete failures and 
1959 failed to return meaningful parameters from the la 
lightcurve fit), leaving 12487 SNe for further analysis. 

Essentially this Hubble KDE approach simply checks 
whether or not an object lies close to the true cosmology 
curve on the Hubble diagram (defined by the la KDE) at 
that redshift. However, there are many non-la's which lie 
close to the true cosmology curve. As a result one either has 
to be very strict with cuts (and therefore lose many true 
la's) or one has to accept a large number of false positives: 
non-la's that are classified as la's. 

Because there are so many non-la's this, and similar 
Hubble-diagram based methods (such as the Portsmouth 
entry to the SNPCC), are less competitive as classifiers. In 
addition they also require a redshift estimate for the SNe 
and are hence doubly inferior compared with the 21D KDE 
and boosting. 



4.5 Combining 21D and Hubble KDEs 

In Section [4. 1| we described the 2 ID KDE approach, and in 
Section |4.4| we described the Hubble KDE approach. In this 
section, we describe how we combined these approaches. As 
outlined in Appendix[B] there are several ways of combining 
odds from different algorithms to construct a better com- 
bined classifier. For our combination entries in the SNPCC, 
we constrained our classifier to be of the form: 



(Hubble odds) Q ■ (21D oddsf > r). 



(11) 



This corresponds to a straight line in Figure [20] The 
scores for the combination entry was 0.28. Surprisingly, this 
was less than the score obtained using the 21D KDE alone, 
and so we believe that the line chosen for the SNPCC was 
poor. A straight line does seem to be a good choice for the 
distribution of values in Figure |20| but perhaps a better 
choice would be of the form: 



A pure 21D odds classifier would rely on a vertical deci- 
sion line, and a pure Hubble odds classifier would rely on a 
horizontal line, but it is clear from Figure [20]that a classifier 
of the form 11 (dashed) or 12 (solid) should work better. Fig- 
ure [20] shows the separation of la's and non-la's that come 
from using the Hubble KDE odds and 2 ID KDE odds with 
the integration of errors presented in Section |3.1.1| The op- 



timal lines of forms ( 11 \ and ( 12 \ result in scores of 0.24 and 



Hubble odds > 71 and 2 ID odds > 72. 



(12) 



0.22 respectively in the case of non-representative training 
and 0.45 and 0.42 respectively in the case of representative 
training. These scores are calculated using a purely 2 ID odds 
classification for the ~ 8000 SNe without SALT2 fits, and 
a 21D-Hubble combination for the remaining ~ 12500 SNe 
with SALT2 fits. As with boosting, the 21D KDE classifier 
is significantly worse using the post-SNPCC data as previ- 
ously discussed in section [4.2.2| and so comparison between 
these post-SNPCC scores and other SNPCC scores should 
not be made until further analyses have been done. 

To be in the top-right corner of Figure [20| and therefore 
be classified as la, requires that a candidate must simultane- 
ously lie close to the true cosmology distance modulus and 
have multiband lightcurves that have the right shape; a very 
natural approach to SNIa classification. It would be interest- 
ing to combine the Hubble odds with 21D boosting instead 
of 21D KDE, as boosting the twenty parameters produces 
better results, as seen in Section [4. 2| 

An obvious extension, if one wanted to combine the out- 
puts from more than two classifiers, would be to use them as 
inputs to a new boosting analysis. The odds from the 21D 
KDE, the Hubble KDE, the 21D boosting, and indeed any 
classifier of sufficient ability can be used as weak classifiers 
in boosting. A reason to exercise caution in using boosting 
or a neural network as a final classifier in this way is the pos- 
sibility of overtraining, but this can be prevented by using 
tenfold cross-validation. 



5 DISCUSSION AND CONCLUSIONS 

In this paper we have discussed the problem of classifying 
supernovae (SNe) into sub-classes (Type la or non-la) based 
on photometric lightcurve data alone, that is, multi-band 
fluxes as a function of time. This will be necessary for future 
surveys which will detect vastly more candidates than will 
be possible to follow up spectroscopically. 

We have investigated two novel classes of classification 
algorithms, Kernel Density Estimation (KDE) and boosting, 
and applied them to simulated SNe lightcurve data, finding 
that the methods performed impressively as long as they 
were trained on a representative sample. Using the KDE 
approach, we considered both a 21 dimensional case based 
on lightcurve parameters from all bands and a 2 dimensional 
version based on fits to the Hubble diagram, using redshift 
information and an estimate of the distance modulus ob- 
tained using standard lightcurve fitting software. 

A key issue for the classification methods we used was 
the issue of the training data sets. We compared the results 
based on training on two very different data sets: the first, a 
non-representative set, mimicking the kind of spectroscopic 
sample available as part of the follow-up program of a typical 
current-generation SN survey. The second was a representa- 
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Figure 20. SNe of type la (cross) and non-la (circle), located 
according to their 21D odds (x-axis) and Hubble odds (y-axis). 
(Above) Odds were calculated from KDEs constructed using the 
non-representative training sample. (Below) A corresponding plot 
where KDEs were constructed with the representative training 
sample. We see that the separation obtained is smaller when non- 
representative training is used, and indeed the score obtained in 
the non-representative case is significantly lower. Note that the 
SNe in this figure are a random sample of the ~ 12500 with a 
meaningful SALT2 fit. 



tive sample of the same size where training objects were 
selected at random from the full sample. 

In general we found that the training on the represen- 
tative sample produced exceptionally good results and that 
cross-validation on the training sample was able to accu- 
rately predict the purity and efficiency of the method on 
the full sample. On the other hand, training on the non- 
representative sample lead to relatively poor performance 
on the full data set. The importance of having an unbi- 
ased, representative sample is illustrated by the fact that 
for boosting, representative samples larger than about 50 
objects outperformed the full non-representative sample of 
1000 objects, as shown in Figure |l4| 

Our primary result and recommendation therefore is 
that boosting and KDE are powerful methods for SN clas- 
sification, with remarkably little astrophysical input. How- 
ever, they require training samples that are as unbiased and 
representative as possible. Further, we found that a small 
unbiased training sample outperforms a much larger, but 
biased, training sample. 

Our other main result is that neither boosting nor the 
21D KDE method suffered particularly when the SN redshift 
information was unavailable. This is particularly gratifying 
given that accurate SN/host galaxy redshifts will not be 
available for most candidates in the future and that meth- 
ods based on the Hubble diagram critically require redshift 
information to perform successfully. 
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Figure 21. Hubble diagrams for the boosting results using the 
representative training sample. (Above) Objects that were cor- 
rectly identified by the boosting method. SNela are plotted as 
red triangles, with non-la SNe shown as blue squares. (Below) 
SNe that were incorrectly typed by boosting. SNela that were 
considered to be non-la SNe by boosting are shown as green trian- 
gles, with incorrectly typed non-la SNe shown as purple squares. 
Overplotted on each graph is the best fitting cosmological model 
inferred from the representative training sample. 



While the algorithms we have presented were successful, 
there are modifications to our boosting implementation that 
should be experimented with, for example different choices 
of lightcurve parameterisation. Further it would be very use- 
ful to investigate methods to reduce sensitivity to biases in 
the training data. 

Finally it is perhaps useful to comment on how our 
methods compare to the winner of the SNPCC (the meth- 
ods we described in this paper finished second and third in 
the competition) which used a template-based method and 
performed very well. Our first comment is that comparison 
is hard because there was an overlap between the templates 
used by Sako and those used to generate the SNPCC, as 
described in Kcsslcr et al. (2010b I, so it is not clear how 



the method would perform on completely independent data. 
Secondly, it is not clear how the various methods would 
perform with different Figures of Merit, an important is- 
sue which we do not discuss here. It is clear that finding the 
best approach to supernova classification, and the best way 
to combine results from different classifiers, will be an active 
area of research in the coming decade. 
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Figure 22. Cumulative histograms of the residuals from the best- 
fit Hubble diagram, determined using the SNela in the represen- 
tative training sample. (Above) Residuals for the representative 
training sample. SNela are plotted in blue, with non-la SNe shown 
in red. (Below) Residuals for the boosting results. SNela that 
were correctly typed are shown in blue, with correctly typed non- 
la SNe shown in blue. SNela that were considered to be non-la 
SNe by boosting are shown in green, with incorrectly typed non-la 
SNe shown in purple. 



APPENDIX B: PROBABILISTIC 
INTERPRETATION AND COMBINATION OF 
PROBABILITIES 

By evaluating each KDE, we may obtain the probability of 
observing a lightcurve (with the lightcurve data denoted as 
x) conditioned on the supernova being a la or not, i.e. we get 
pi — p(sjla) and p2 = p(x|non-Ia). The ratio of p\ to p2 is 
known as the Bayes factor, B\2. What interests us, however, 
is the relative probability of the observation x being from a 
la supernova versus another type. That relative probability, 
called the Odds ratio, odds(x) is 



P(Ia|a;) 
P(non-Ia|:r) 
odds(x) 



Pi 



P(Ia) 



P2 



= B 



P(x) 

P (non-la) 

W) 

P(la\x) 
P(non-Ia|:r) 
P(Ia) 



piP(Ia) 
P2P(non-Ia) 



(Bl) 
(B2) 
(B3) 



' P (non-la) 



The probabilities P(Ia) and P(non-Ia) are the prior prob- 
abilities to observe a la supernova or one of another type 
respectively. 

To convert the relative probability back into absolute 
probabilities we need use the fact that there are only two 
possibilities (la or not), so that P2 = (1 — Pi). In this case 
we have that 
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Pi = odds(x)/(l + odds(x)). 



(B4) 



APPENDIX A: CROSS-VALIDATION 

Cross-validation is a statistical technique that enables one 
to tune model parameters so as to optimize model predic- 
tion. Within the context of the 21D KDE, both the kernel 
bandwidth h, the number of nearest neighbours k and the 
odds threshold may be optimized for some figure of Merit 
(FoM) by tenfold cross-validation. This entails partitioning 
the training set into 10 roughly equal parts. One may then 
use nine-tenths of the data to estimate the la and non-la 
probability densities and then use these probability densi- 
ties to classify the remaining one-tenth of the training set. 
This may be repeated ten times, predicting the class for each 
of the ten partitions of the data using the KDEs estimated 
from the remaining nine partitions. Since we know the super- 
nova types of the training set, we can then find a combina- 
tion of the aforementioned three parameters that maximizes 
the FoM. Cross-validation can be used in a similar way for 
boosting. Figure |E1| uses cross-validation to determine that 
4000 trees will be near optimal. 



If we have two independent observations x and y then 
we can update the relative probability odds(x) from obser- 
vation x: 



odds(x,y) — odds(x) 



p(y I non-la) 



(B5) 



We can use this to combine for example the probability from 
the 21 dimensional KDEs with information from the Hub- 
ble diagram, but we have to be careful if the 21D KDEs 
already contain some of the Hubble information implicitly, 
e.g. through the evolution of the overall amplitudes of the 
lightcurves as a function of redshift. 

It is possible that the KDEs should not be interpreted 
as probabilities. This may be due to oversmoothing of too 
wide kernels, or shot noise from too narrow kernels. With 
a sufficiently large training set one can test how accurately 
the KDEs represent probabilities - the proportion of SNela 
in a (calculated) odds bin should equal that predicted by 
Eq. |B4| If it is not, one can consider making a mapping 
from the calculated odds to the true odds. 

If in combining probabilities one does not want to as- 
sume independence, or does not trust the probabilities and 
doesn't want to make a mapping to true probabilities, there 
are several alternatives to Eq. |B5| Some of these include 
capping unreliable odds at 1, using linear combinations of 
odds instead of products, using p-values instead of proba- 
bilities and down-weighting particularly small/large Bayes' 
factors. Often an optimal method can be decided on by con- 



sidering a scatter plot (like Figure 20 1 of the training set. In 
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Figure CI. (Above left) At several Xi £ -R 2 we have a value 
Zi £ R, represented by a rectangle if negative and a circle if 
positive, with the size of the shape being proportional to the 
magnitude of Zj. We want to split the set of observations by X^ 1 ' 
or X^ to minimise the average in-group variance. (Above right) 
After considering all vertical and horizontal lines, we settle on 
this vertical line as our first "branching" as it minimizes in-group 
variance. (Below left) Sub-branches are chosen to minimize in-in- 
group variance. (Below right) A tree of depth 3. 



Section [4. 5| we considered two new ways of combining odds, 
equations 1 11| and |12| 



APPENDIX C: BEST TREES 

Suppose we have some data Xi 6 R 2 , Zi £ R, and we would 
like to fit Zi ~ X using a tree. To be precise, we would like 
to find a tree which minimises ^2™= 1 (T(Xi) — Zi) 2 , where 
T(Xi) = Vk when Xi falls into node k of the tree. We there- 
fore need to find two things, the tree shape and the "leaf 
values (the Vks). Figure CI illustrates the idea of "greedy" 
tree construction. Note that this may not be the best depth 
3 tree. The greedy approach ignores several potential trees. 
However it is quick and easy, and for boosting where thou- 
sands of trees are made it is not necessary to have exactly 
minimising trees at each step. See also our animation of tree 
construction on the arxiv at Cosmology at AIMS| pOlO"] ) . 
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Figure El. The score, predicted using tenfold validation, on the 
representative training sample (1.0). Seems to be no overfitting 
as more trees are added. We go with number of trees = 4000. 



APPENDIX E: GBM IN FULL 

In this appendix we complete the GBM algorithm presented 
111 Section ETTl There was no mention in Section 13.2.31 of 
the learning rate v, or the bagging fraction <j>. The learning 
rate v £ [0, 1] should appear in step 4 of the main loop. 
Originally given as, 

F k «- F k -! + T k 
step 4 should appear as, 

F k <- F k -i + vT k 

The learning rate should be set quite low, we used 0.05. 
It acts to reduce the sensitivity of F to the initial tree choice. 

The use of bagging has been shown to improve the effi- 
ciency of the GBM algorithm and the accuracy of the final 
classifier (Friedman 2002}. The idea of bagging is that in- 
stead of all the training data being used for every tree con- 
struction, a fraction (<f>) is randomly chosen to fit the tree 
at each step. For the SNPCC we used cj> — 0.5. To include 
bagging, the inner for loop should be modified to read, 

for i in {sample of size <f> ■ N from integers 1 to N} 

The last modification that needs to be made to complete 
the GBM algorithm is at step 3 of the main loop. Full line 
searches for optimal 7fc,/s are not used, instead to speed up 
the algorithm only the initial step of Newton's method is 
used: 



7*,j = 



Zi 



}^ \sh\ (2 — \zi\ 



(El) 



APPENDIX D: CALCULATING PARAMETER 
IMPORTANCE 

To measure how much information each parameter carries 
in the boosting classifier, we can do the following. For each 
branching within each tree constructed from the training 
data, calculate how much total ingroup variance was re- 
duced by this branching. Then for each parameter, for all 
the branchings which it defines add up the ingroup variance 
reductions. This value is a good indicator of a parameter's 
importance in classificaction. 



APPENDIX F: PARAMETER DISTRIBUTIONS 

In this appendix we look at how the five lightcurve fitting 
parameters and redshift differ between la's and non-la's, and 
between training and test SNe. Ia SNe cumulative frequency 
lines are red and thick, while non-la SNe are blue and thin. 
The cumulative frequency lines for training SNe are dotted, 
while the cumulative frequency lines for the unspecified SNe 
are solid. This appendix comprises Figures fl6] to [F5] 
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Figure Fl. Cumulative plots of parameter A in bands g,r,i,z. 
Non-representative training (dashed) vs unclassified (solid) and 
la (red, thick) vs non-la (blue, thin). In all bands, the magnitude 
A of SNe is far larger in the non-representative training set than 
in the unclassified set. 



Figure F4. Cumulative plot of parameter a in bands g,r,i,z. 
Non-representative training (dashed) vs unclassified (solid) and 
la (red, thick) vs non-la (blue, thin). 




Figure F2. Cumulative plot of parameter tail in bands g,r,i,z. 
Non-representative training (dashed) vs unclassified (solid) and 
la (red, thick) vs non-la (blue, thin). 





Figure F5. Cumulative plot of parameter <f> in bands g,r,i,Z. 
Non-representative training (dashed) vs unclassified (solid) and 
la (red, thick) vs non-la (blue, thin). 



APPENDIX G: ADDITIONAL FIGURES 

This appendix contains two more boosting parameter im- 
portance Figures: |G1| and |G2| 



APPENDIX H: RANDOM SNE 

This appendix contains a random selection of unclassified 
la and non-la SNe and their boosting values from represen- 
tative training. Also, had a threshold of zero been used on 
the boosting value, would the classification have been cor- 
rect (/) or incorrect (/). An extension of this appendix (200 
SNe) can be found online at|Cosmology at AIMS| (|2010 1. 
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Figure F3. Cumulative plot of parameter k in bands g,r,i,z. 
Non-representative training (dashed) vs unclassified (solid) and 
la (red, thick) vs non-la (blue, thin). 
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Figure HI. Ia SN at z = 0.64. Boosting value of 1.78. / 



Figure Gl. The importance of parameters in distingushing Ia 
from non-la in the non-representative training sample. 




Figure G2. The importance of parameters in distinguishing non- 
representative training (with spectrum) from unclassified (with- 
out spectrum) non-la SNe using boosting. 
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